Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
Import Apache Spark 2.1libraries and setup environment variables. In case for cluster deployment you should probably remove this cell.
import os
import sys
import time
# Change to path where apache spark 2.x is downloaded
SPARK_PATH = '/users/suchy/Documents/spark-2.1.0-bin-hadoop2.7'
os.environ['SPARK_HOME'] = SPARK_PATH
SPARK_HOME = os.environ['SPARK_HOME']
#Add the following paths to the system path.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.4-src.zip"))
Initialize Spark Context for driver program
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setAppName("Capital_Bike_weather_mining")
.set("spark.executor.memory", "4g")
.set("spark.driver.memory", "8g"))
#.setMaster("yarn") - for cluster deployment
sc = SparkContext.getOrCreate(conf = conf)
sqlContext = SQLContext(sc)
sc = sc.setCheckpointDir("checkpoint")
Import pandas, numpy, matplotlib,and seaborn. Then set %matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kstest
from sklearn import preprocessing
from scipy.stats import skew
from scipy.stats import boxcox
from scipy.special import inv_boxcox
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
We are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. We must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
</a>Read in the Customers Bike Rental csv train and test files as a Pandas DataFrame.
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")
There are 3 dependent variables: casual, registered and count (casual + registered) .
We will try both methods: predict amount of casual and regitered rentals separately and predict sum of those (count)
train_df.head()
test_df.head()
Check out customers rental info() and describe() methods. Thera are total 10866 entries for traing data and 6493 entries for test data, none of the column has missing values
train_df.info()
test_df.info()
continuous_var = ['temp', 'atemp', 'humidity', 'windspeed']
# exclude binary variables
categorical_var=['season','weather','year','month','hour','dayofweek','weekofyear','quarter','hour_cat', 'temp_cat']
dependent_variables = ['casual', 'registered', 'count']
dependent_variables_log = ['casual_log', 'registered_log', 'count_log']
dependent_variables_bc = ['casual_bc', 'registered_bc', 'count_bc']
# dependent_variables_cat used for ChiSqSelector, which works only for classified labels
dependent_variables_cat = ['casual_cat', 'registered_cat', 'count_cat']
added_cat_var = ['season_cat', 'month_cat', 'dayofweek_cat', 'weather_cat']
seasonDict = {1: "Winter", 2 : "Spring", 3 : "Summer", 4 :"Fall" }
monthDict = {1: "January", 2 : "February", 3 : "March", 4 :"April" ,\
5 : "May", 6 : "June", 7 : "July", 8 :"August" ,\
9: "September", 10 : "October", 11 : "November", 12 :"December"}
dayofweekDict = {0: "Monday", 1 : "Tuesday", 2 : "Wednesday", 3 :"Thursday" ,\
4 : "Friday", 5 : "Saturday", 6 : "Sunday"}
weatherDict = {1: "Clear", 2 : "Mist", 3 : "Light_Snow", 4 :"Heavy_Rain" }
Data preparation part: parse data timestamp, add dummy and derivated variables. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. So it's good idea to have categorical variables with similar level of unique values.
Lot's of data from the function below are taken in next section of Exploratory Data Analysis, like rental peak hours.
def set_derivated_vars(dataset):
customers_rental = dataset.copy()
# create derivate variable: day-to-day and hour change for continuous variables
for var in continuous_var:
#customers_rental[var+'_prev_hour_change_pct'] = customers_rental[var].pct_change(periods=1) * 100
# to make positive values means percentage increase of variable in next hour
customers_rental[var+'_next_hour_change_pct'] = customers_rental[var].pct_change(periods=-1)*(-1) * 100
#customers_rental[var+'_prev_day_change_pct'] = customers_rental[var].pct_change(periods=24) * 100
customers_rental[var+'_next_day_change_pct'] = customers_rental[var].pct_change(periods=-24) *(-1) * 100
customers_rental['atemp_temp_diff'] = customers_rental['atemp'] - customers_rental['temp']
# first day don't have pct_change, as well as there are some divide by zero operation (inf)
customers_rental = customers_rental.replace([np.inf, -np.inf], np.nan)
# Replace nan by interpolating, first argument can't be NaN for interpolation to works,
# In Pandas interpolate implementation NaN variable above the range
# will we interpolated as well (like extrapolation)
customers_rental.iloc[0] = customers_rental.iloc[0].fillna(customers_rental.mean())
customers_rental = customers_rental.interpolate(method='time')
customers_rental = set_temp_cat(customers_rental)
return customers_rental
def set_temp_cat(dataset):
customers_rental = dataset.copy()
# create additional variable about rental traffic by temp,
# specific hours are taken from Exploratoty Data Analysis from clustermap by the end of next section
customers_rental['temp_cat'] = 0
customers_rental.loc[(customers_rental['temp'].round() <= 12), 'temp_cat'] = 1
customers_rental.loc[(customers_rental['temp'].round() >= 13) & (customers_rental['temp'].round() <= 30) & \
(customers_rental['workingday'] == 0), 'temp_cat'] = 2
customers_rental.loc[(customers_rental['temp'].round() >= 13) & (customers_rental['temp'].round() <= 30) & \
(customers_rental['workingday'] == 1),'temp_cat'] = 3
customers_rental.loc[(customers_rental['temp'].round() >= 31), 'temp_cat'] = 4
return customers_rental
def set_hour_cat(dataset):
customers_rental = dataset.copy()
# create additional variable about rental traffic by hour,
# specific hours are taken from Exploratoty Data Analysis from clustermap by the end of next section
customers_rental['hour_cat'] = 1
customers_rental.loc[(customers_rental['hour'] <= 6) | (customers_rental['hour'] == 23), 'hour_cat'] = 2
customers_rental.loc[(customers_rental['hour'] >= 7) & (customers_rental['hour'] <= 8) & \
(customers_rental['workingday'] == 1), 'hour_cat'] = 3
customers_rental.loc[(customers_rental['hour'] >= 10) & (customers_rental['hour'] <= 16) & \
(customers_rental['workingday'] == 0), 'hour_cat'] = 3
customers_rental.loc[(customers_rental['hour'] >= 17) & (customers_rental['hour'] <= 18) & \
(customers_rental['workingday'] == 1), 'hour_cat'] = 3
customers_rental.loc[((customers_rental['hour'] >= 20) & (customers_rental['hour'] <= 22)) | \
(customers_rental['hour'] == 9), 'hour_cat'] = 4
return customers_rental
def set_independent_weather(dataset):
customers_rental = dataset.copy()
#############################################
# data taken from independent source weather
# https://www.washingtonpost.com/blogs/capital-weather-gang/post/2011-in-washington-dc-warm-with-extreme-weather-aplenty/2011/12/27/gIQAnNXSMP_blog.html?utm_term=.7a2584169af6
# https://www.washingtonpost.com/blogs/capital-weather-gang/post/top-5-dc-weather-events-of-2012/2012/12/28/d384311c-4f0e-11e2-950a-7863a013264b_blog.html?utm_term=.1a10a3265c33
#############################################
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 15), "workingday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 16), "workingday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 15), "holiday"] = 0
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 16), "holiday"] = 0
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 25), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 23), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 25), "workingday"] = 0
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 23), "workingday"] = 0
customers_rental.loc[(customers_rental['month'] == 12) \
& ((customers_rental['day'] == 24) | \
(customers_rental['day'] == 26)), "workingday"] = 0
customers_rental.loc[(customers_rental['month'] == 12) \
& ((customers_rental['day'] == 24) | \
(customers_rental['day'] == 25) | \
(customers_rental['day'] == 26)), "holiday"] = 1
# 2011
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 1) \
& (customers_rental['day'] == 26), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 9) \
& (customers_rental['day'] >= 4) & (customers_rental['day'] <= 8), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 8) \
& (customers_rental['day'] == 28), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] >= 27) & (customers_rental['day'] <= 28), "holiday"] = 1
# 2012
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 10) \
& (customers_rental['day'] == 30), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 3) \
& (customers_rental['day'] == 2), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 6) \
& (customers_rental['day'] == 1), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 6) \
& (customers_rental['day'] == 29), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 5) \
& (customers_rental['day'] == 21), "holiday"] = 1
return customers_rental
def prepare_data(dataset):
customers_rental = dataset.copy()
# next_hour_weather is category for next hour weather
next_hour_weather = customers_rental['weather'].copy()
next_hour_weather[:len(next_hour_weather)-1] = next_hour_weather[1:len(next_hour_weather)]
customers_rental['next_hour_weather'] = next_hour_weather.reset_index()['weather']
dt = pd.DatetimeIndex(customers_rental['datetime'])
customers_rental.set_index(dt, inplace=True)
customers_rental['date'] = pd.to_datetime(customers_rental['datetime'])
customers_rental["year"] = customers_rental["date"].dt.year
customers_rental["month"] = customers_rental["date"].dt.month
customers_rental["day"] = customers_rental["date"].dt.day
customers_rental["hour"] = customers_rental["date"].dt.hour
customers_rental["dayofweek"] = customers_rental["date"].dt.dayofweek
customers_rental["weekofyear"] = customers_rental["date"].dt.weekofyear
customers_rental["season_cat"] = customers_rental["season"].map(seasonDict)
customers_rental["month_cat"] = customers_rental["month"].map(monthDict)
customers_rental["dayofweek_cat"] = customers_rental["dayofweek"].map(dayofweekDict)
customers_rental["weather_cat"] = customers_rental["weather"].map(weatherDict)
customers_rental['quarter'] = 0
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 0), 'quarter'] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 3), 'quarter'] = 2
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 6), 'quarter'] = 3
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 9), 'quarter'] = 4
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 0), 'quarter'] = 5
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 3), 'quarter'] = 6
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 6), 'quarter'] = 7
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 9), 'quarter'] = 8
# from EDA
customers_rental['bike_season'] = 0
customers_rental.loc[(customers_rental['month'] >= 4) & (customers_rental['month'] <= 11), 'bike_season'] = 1
# create derivated variables, used among others in outliers detecion
customers_rental = set_derivated_vars(customers_rental)
customers_rental = set_hour_cat(customers_rental)
customers_rental = set_independent_weather(customers_rental)
# train data has days 1-19 of each month, test data > 20
customers_rental['train'] = 0
customers_rental.loc[(customers_rental['day'] < 20), 'train'] = 1
# Apache Spark solver uses square loss function for minimalization.
# Our evalutation metric will be RMSLE so it's good idea to use logarithmic transformation
# of dependent cols (adding 1 first so that 0 values don't become -inf)
#
# Dependent_variables_cat used for ChiSqSelector, which works only for classified labels
# lambda dictionary necessery for inverted boxcox transformation
bc_lambda_dict = {}
for col in dependent_variables:
customers_rental.loc[(customers_rental['day'] >= 20), col] = 0
customers_rental['%s_log' % col] = np.log(customers_rental[col] + 1)
# boxcox transformation for 4 subsets in order to get better normal-like data distribution in each subset
# subsets are splitted by similarity on hour's impact on bike's rental
bc1 = boxcox(customers_rental[(customers_rental['hour_cat'] == 1) & (all_no_dummy['train'] == 1)][col] +1)
bc2 = boxcox(customers_rental[(customers_rental['hour_cat'] == 2) & (all_no_dummy['train'] == 1)][col] +1)
bc3 = boxcox(customers_rental[(customers_rental['hour_cat'] == 3) & (all_no_dummy['train'] == 1)][col] +1)
bc4 = boxcox(customers_rental[(customers_rental['hour_cat'] == 4) & (all_no_dummy['train'] == 1)][col] +1)
customers_rental['%s_bc' % col] = 0
customers_rental.loc[(customers_rental['hour_cat'] == 1) & (all_no_dummy['train'] == 1), \
'%s_bc' % col] = bc1[0]
customers_rental.loc[(customers_rental['hour_cat'] == 2) & (all_no_dummy['train'] == 1), \
'%s_bc' % col] = bc2[0]
customers_rental.loc[(customers_rental['hour_cat'] == 3) & (all_no_dummy['train'] == 1), \
'%s_bc' % col] = bc3[0]
customers_rental.loc[(customers_rental['hour_cat'] == 4) & (all_no_dummy['train'] == 1), \
'%s_bc' % col] = bc4[0]
customers_rental['%s_cat' % col] = 0
customers_rental.loc[(customers_rental['day'] < 20), '%s_cat' % col] = \
pd.cut(customers_rental.loc[(customers_rental['day'] < 20), '%s_bc' % col],\
bins=96, labels=range(0,96), include_lowest=True)
bc_lambda_dict[col] = [bc1[1], bc2[1], bc3[1], bc4[1]]
# drop unnecessary variables
customers_rental.drop(['datetime','date'], axis=1, inplace=True)
return customers_rental, bc_lambda_dict
We will use dataset with dummy variables for linear regression We will use dataset without dummy varaibles for random forest regression and gradient boosted regression. bc_lambda_dict is dictionary for lambdas used in BoxCox transformation, and used for inverted BoxCox transformation.
all_no_dummy, bc_lambda_dict = prepare_data(train_df.append(test_df))
Columns structure (without dummies) after data preparation part. Later we will explore which set of those columns use for machine learning models.
all_no_dummy.info()
Let's explore the data!
sns.set_palette("coolwarm")
sns.set_style('whitegrid')
Median of continuous variables looks reasonably. There are many outliers in windspeed variable and some outliers for zero humidity.
fig,(ax1,ax2)= plt.subplots(ncols=2, figsize=(12,4))
sns.boxplot(data=all_no_dummy[continuous_var], ax=ax1)
sns.boxplot(data=all_no_dummy[all_no_dummy['train'] == 1][dependent_variables], ax=ax2)
Let's find out what is distribution curves for dependent variables. Most of the current machine learning algorithmics performs best on normally distributed data. Pure distribution of depednent variables look more to be Poisson than Gaussian. Log transformation can helps a lot in converting data to normal-like distribution but BoxCox method do the best job by automatically detect proportion (lambda) of log and sqrt transformation.
fig,axes= plt.subplots(nrows=3, ncols=2, figsize=(11,10))
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['count'], bins = 24, ax=axes[0][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['count_log'], bins = 24, ax=axes[0][1], color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['casual'], bins = 24, ax=axes[1][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['casual_log'], bins = 24, ax=axes[1][1], color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['registered'], bins = 24, ax=axes[2][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['registered_log'], bins = 24, ax=axes[2][1], color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
We can see that LOG distribution looks more normal-like but it's clearly skewed. BoxCox helps us deal with it. But first let's examine raw data divided into 4 hour categories.
fig,axes= plt.subplots(nrows=3, ncols=4, figsize=(12,10))
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['registered'], bins = 24, ax=axes[0][0], axlabel='Registered, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['registered'], bins = 24, ax=axes[0][1], axlabel='Registered, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['registered'], bins = 24, ax=axes[0][2], axlabel='Registered, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['registered'], bins = 24, ax=axes[0][3], axlabel='Registered, HOUR_CAT = 4', color='blue')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['casual'], bins = 24, ax=axes[1][0], axlabel='Casual, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['casual'], bins = 24, ax=axes[1][1], axlabel='Casual, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['casual'], bins = 24, ax=axes[1][2], axlabel='Casual, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['casual'], bins = 24, ax=axes[1][3], axlabel='Casual, HOUR_CAT = 4', color='blue')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['count'], bins = 24, ax=axes[2][0], axlabel='Count, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['count'], bins = 24, ax=axes[2][1], axlabel='Count, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['count'], bins = 24, ax=axes[2][2], axlabel='Count, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['count'], bins = 24, ax=axes[2][3], axlabel='Count, HOUR_CAT = 4', color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Data are highly skewed. It's time to use BoxCox transformation.
fig,axes= plt.subplots(nrows=3, ncols=4, figsize=(12,10))
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['registered_bc'], bins = 24, ax=axes[0][0], axlabel='Registered, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['registered_bc'], bins = 24, ax=axes[0][1], axlabel='Registered, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['registered_bc'], bins = 24, ax=axes[0][2], axlabel='Registered, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['registered_bc'], bins = 24, ax=axes[0][3], axlabel='Registered, HOUR_CAT = 4', color='blue')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['casual_bc'], bins = 24, ax=axes[1][0], axlabel='Casual, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['casual_bc'], bins = 24, ax=axes[1][1], axlabel='Casual, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['casual_bc'], bins = 24, ax=axes[1][2], axlabel='Casual, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['casual_bc'], bins = 24, ax=axes[1][3], axlabel='Casual, HOUR_CAT = 4', color='blue')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 1) & (all_no_dummy['train'] == 1)]['count_bc'], bins = 24, ax=axes[2][0], axlabel='Count, HOUR_CAT = 1', color='orange')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 2) & (all_no_dummy['train'] == 1)]['count_bc'], bins = 24, ax=axes[2][1], axlabel='Count, HOUR_CAT = 2', color='red')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 3) & (all_no_dummy['train'] == 1)]['count_bc'], bins = 24, ax=axes[2][2], axlabel='Count, HOUR_CAT = 3', color='green')
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == 4) & (all_no_dummy['train'] == 1)]['count_bc'], bins = 24, ax=axes[2][3], axlabel='Count, HOUR_CAT = 4', color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Right now only casual rental in hour_cat = 2 (from 23 p.m till 6 a.m) are skewed toward zero. Rest of the distributions look good. Let's test those distribution with Kolmogorov-Smirnov Test. P-Value above 0.05 will reject distribution as to be Gaussian one.
for col in dependent_variables:
print("Kolmogorov-Smirnov Test for " + col)
for hour_cat in range(1,5):
print("Hour_cat " + str(hour_cat) + ": " + str(kstest(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)]['%s_bc' % col], 'norm')))
print()
All distribution passed test, let's find out what is skew for those distributions. The closer to zero the better.
for col in dependent_variables:
print("Distribution skew for " + col)
for hour_cat in range(1,5):
print("Hour_cat " + str(hour_cat) + ": " + str(skew(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)]['%s_bc' % col])))
print()
Let's compare those results skews with raw data.
for col in dependent_variables:
print("Distribution skew for " + col)
for hour_cat in range(1,5):
print("Hour_cat " + str(hour_cat) + ": " + str(skew(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)][col])))
print()
Looks like we did big improvement. Good work.
Explore data of daily average of one hour bike rentals across all years 2011 and 2012. Sliding windows is of interval 7 days for dependent variables and 3 days for continuous independent variables. Default are set to 75 percentage change computed as (a-b)/b.
outliers = {}
def plot_with_window(column, dataset=all_no_dummy, all=True, threshold=75):
if (all):
day_avg = 3
data = dataset
data_window_mean = data.rolling(window=day_avg * 24).mean()
data_window_std = data.rolling(window=day_avg * 24).std()
title = column + ' over time'
else:
day_avg = 7
data = dataset[all_no_dummy['train'] == 1].groupby(['year','month','day']).mean()
data_window_mean = data.rolling(window=day_avg).mean()
data_window_std = data.rolling(window=day_avg).std()
title = 'Daily mean of ' + column + ' over time'
diff = np.abs(data[column] - data_window_mean[column])
# diff for outliers candidates has to be at least above threshold % to avoid false outliers
outliers = diff[(diff > 3 * data_window_std[column]) & (diff > threshold)]
plt.figure(figsize=(9,4))
data[column].plot()
data_window_mean[column].plot(label=str(day_avg) + ' Day Avg', lw=1, ls='--', c='red')
plt.legend(loc='best')
plt.title(title)
plt.tight_layout()
return outliers
def unique_outliers(col_list):
outliers_list = []
for col in col_list:
for timestamp in outliers[col].index:
outliers_list.append(timestamp)
return sorted(list(set(outliers_list)))
Characterics of casual renting is very changeable with many positive peaks.
for col in dependent_variables:
outliers[col] = plot_with_window(col, all=False, threshold=10)
Plot for registred users looks almost the same becouse on average about 80% of total rentals are made be registered users.
all_no_dummy[all_no_dummy['train'] == 1][['casual', 'registered', 'count']].describe().ix['mean']
Thera are no outliers for dependent variables (by our criteria).
all_no_dummy.ix[unique_outliers(dependent_variables)]
Let's find out whether there are some weather data outliers.
temp_var_list = ['temp',
#'temp_prev_hour_change_pct',
'temp_next_hour_change_pct',
#'temp_prev_day_change_pct',
'temp_next_day_change_pct']
for col in temp_var_list:
outliers[col] = plot_with_window(col)
Some of the outliers below.
all_no_dummy.ix[unique_outliers(temp_var_list), temp_var_list].head()
atemp_var_list = ['atemp',
#'atemp_prev_hour_change_pct',
'atemp_next_hour_change_pct',
#'atemp_prev_day_change_pct',
'atemp_next_day_change_pct']
for col in atemp_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(atemp_var_list)][atemp_var_list].head()
outliers['atemp_temp_diff'] = plot_with_window('atemp_temp_diff', threshold=10)
all_no_dummy.ix[unique_outliers(['atemp_temp_diff'])]['atemp_temp_diff'].head(400)
We can see one clear outlier in humidity. During minium day for count variable (2011, 3, 6), humidity had one of its highest value. It matches our understanding - there are fewer bike rentals during rain. On (2011, 3, 10) humidity has zero daily mean value although in adjacent days there was high value for humidity. It looks like some missing values.
all_no_dummy.groupby(['year','month','day']).mean().idxmin()
humidity_var_list = ['humidity',
#'humidity_prev_hour_change_pct',
'humidity_next_hour_change_pct',
#'humidity_prev_day_change_pct',
'humidity_next_day_change_pct']
for col in humidity_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(humidity_var_list)][humidity_var_list].head()
Windspeed becouse of its nature has very inregular characteriscs but in a way it has some stable mean level. Looks like this variable doesn't add value for our predictive models.
windspeed_var_list = ['windspeed',
#'windspeed_prev_hour_change_pct',
'windspeed_next_hour_change_pct',
#'windspeed_prev_day_change_pct',
'windspeed_next_day_change_pct']
for col in windspeed_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(windspeed_var_list)][windspeed_var_list].head()
len(unique_outliers(humidity_var_list + temp_var_list + atemp_var_list + windspeed_var_list + ['atemp_temp_diff']))
def interpolate_data(dataset):
data = dataset.copy()
for col_list in [humidity_var_list] + [temp_var_list] + [atemp_var_list] + [windspeed_var_list]:
for col in col_list:
for timestamp in outliers[col].index:
data.loc[timestamp, col_list] = np.NaN
# avoid first row NaN
data.iloc[0] = data.iloc[0].fillna(data.mean())
data = data.interpolate(method='time')
# after interpolating continuous varialbes we must recompute derivated variables
data = set_derivated_vars(data)
data = set_temp_cat(data)
return data
all_no_dummy_interpolate = interpolate_data(all_no_dummy.copy())
Difference between original and interpolated data.
(all_no_dummy != all_no_dummy_interpolate).sum()
Some exmaple:
2012-01-10 05:00:00 - there were outliers for temp (16.40) and atemp (20.46). Those values were interpolated to (12.77) and (15.26) respectively.
all_no_dummy.loc[('2012-01-10 02:00:00'):('2012-01-10 07:00:00')][continuous_var]
all_no_dummy_interpolate.ix[('2012-01-10 02:00:00'):('2012-01-10 07:00:00')][continuous_var]
From the boxplot below its clear most bike rentals are in hours: 8, 17, 18. We should expect that registred users rent bikes in those hours becouse there are almost no outliers. Between those hours there are hours with many outliers. We should expect that more casual users rent bikes in those hours
plt.figure(figsize=(10,6))
sns.boxplot(x="hour", y="count", data=all_no_dummy[all_no_dummy['train'] == 1])
plt.tight_layout
On the plots below we can see rentals distribution for specifc month. Druing summer there are most demand for bikes. During all season hour-rental characteristics have the same shape: for working days thera are two peaks on 8 and 17-18 hours. Casual users rent bike usually between those hours as well as during weekends.
# below code is taken from https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile
fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4, figsize=(10,20))
#fig.set_size_inches(12,15)
sortOrder = ["January","February","March","April","May","June","July","August","September","October","November","December"]
hueOrder = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
monthAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby("month_cat")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sns.barplot(data=monthSorted,x="month_cat",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")
hourAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby(["hour","season_cat"],sort=True)["count"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season_cat"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')
hourAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby(["hour","dayofweek_cat"],sort=True)["count"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["dayofweek_cat"],hue_order=hueOrder, data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')
hourTransformed = pd.melt(all_no_dummy[all_no_dummy['train'] == 1][["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hour","variable"],sort=True)["value"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
There are not clear linear relationships between variables, besides dependent variables (casual, registerd, count) themselves and temp - atemp. It indicates that machine learning regression algorithms that can handle non-linearity could perfom better than linear regression.
sns.pairplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], vars=continuous_var + dependent_variables, hue='holiday', palette='coolwarm')
There is some correlation (showed on some next cells on heatmap) between casual and registred users rentals.
sns.pairplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], vars=['casual', 'registered'], palette='coolwarm')
It's clear linear relationships between temperature in Celsius and "feels like" temperature in Celsius'. We should drop one of them to avoid multicollinearity
sns.jointplot(x='temp',y='atemp',data=all_no_dummy_interpolate)
There are potentially 25 outliers where absolute difference between temperature in Celsius and "feels like" temperature in Celsius is greater than 5 Celsius degree. We can build another linear regression model to predict atemp for those outliers. But taking into account fact, there are only 24 records, we just simply set mean() on atemp_temp_diff variable for those records and we will drop atemp variable as well.
sns.jointplot(x='temp',y='atemp_temp_diff',data=all_no_dummy_interpolate)
all_no_dummy_interpolate[all_no_dummy_interpolate['atemp_temp_diff'] < -10][continuous_var + ['atemp_temp_diff']]
all_no_dummy_interpolate.loc[all_no_dummy_interpolate['atemp_temp_diff'] < -10 , \
['atemp','atemp_temp_diff']] = np.NaN
all_no_dummy.loc[all_no_dummy['atemp_temp_diff'] < -10 , \
['atemp','atemp_temp_diff']] = np.NaN
all_no_dummy_interpolate = all_no_dummy_interpolate.interpolate(method='time')
all_no_dummy = all_no_dummy.interpolate(method='time')
all_no_dummy_interpolate = set_derivated_vars(all_no_dummy_interpolate)
all_no_dummy = set_derivated_vars(all_no_dummy)
Right now there are no outliers for atemp_temp_diff variable.
all_no_dummy_interpolate[all_no_dummy_interpolate['atemp_temp_diff'] < -10][continuous_var + ['atemp_temp_diff']]
TRAIN data set - categorical variables.
fig,axes= plt.subplots(nrows=4, ncols=2, figsize=(11,10))
sns.countplot(x='holiday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[0][0])
sns.countplot(x='season_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[0][1])
sns.countplot(x='workingday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[1][0])
sns.countplot(x='weather_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[1][1])
sns.countplot(x='dayofweek',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[2][0])
sns.countplot(x='month',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[2][1])
sns.countplot(x='hour',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[3][0])
sns.countplot(x='temp_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[3][1])
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
TEST data set - categorical variables.
fig,axes= plt.subplots(nrows=4, ncols=2, figsize=(11,10))
sns.countplot(x='holiday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[0][0])
sns.countplot(x='season_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[0][1])
sns.countplot(x='workingday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[1][0])
sns.countplot(x='weather_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[1][1])
sns.countplot(x='dayofweek',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[2][0])
sns.countplot(x='month',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[2][1])
sns.countplot(x='hour',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[3][0])
sns.countplot(x='temp_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[3][1])
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Mainly difference between Train and Test caterogical variables distribution is a smaller amount of rows in February for test data set - it's due to fact in 2011 February had 28 days and in 2012 had 29 days.
TRAIN data set - continuous variables.
fig,axes= plt.subplots(nrows=4, ncols=4, figsize=(11,10))
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['temp'], bins=24, ax=axes[0][0], axlabel='TRAIN: temp', color='orange')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['temp'], bins=24, ax=axes[1][0], axlabel='TRAIN: temp_interpolate', color='orange')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['temp'], bins=24, ax=axes[2][0], axlabel='TEST: temp', color='orange')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['temp'], bins=24, ax=axes[3][0], axlabel='TEST: temp_interpolate', color='orange')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['atemp'], bins=24, ax=axes[0][1], axlabel='TRAIN: atemp', color='red')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['atemp'], bins=24, ax=axes[1][1], axlabel='TRAIN: atemp_interpolate', color='red')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['atemp'], bins=24, ax=axes[2][1], axlabel='TEST: atemp', color='red')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['atemp'], bins=24, ax=axes[3][1], axlabel='TEST: atemp_interpolate', color='red')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['humidity'], bins=24, ax=axes[0][2], axlabel='TRAIN: humidity', color='green')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['humidity'], bins=24, ax=axes[1][2], axlabel='TRAIN: humidity_interpolate', color='green')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['humidity'], bins=24, ax=axes[2][2], axlabel='TEST: humidity', color='green')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['humidity'], bins=24, ax=axes[3][2], axlabel='TEST: humidity_interpolate', color='green')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['windspeed'], bins=24, ax=axes[0][3], axlabel='TRAIN: windspeed', color='blue')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['windspeed'], bins=24, ax=axes[1][3], axlabel='TRAIN: windspeed_interpolate', color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['windspeed'], bins=24, ax=axes[2][3], axlabel='TEST: windspeed', color='blue')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['windspeed'], bins=24, ax=axes[3][3], axlabel='TEST: windspeed_interpolate', color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
We can see that both TRAIN and TEST data sets have very similar distribution for continuous variablers so overall there is no need to select specific subset from training data to train models on.
all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].corr().ix[dependent_variables]
Heatmap is better visualisation tool to see some Pearson Correlation. There aren't strong correlations: hour and temp variable have approximately 0.4 Pearson Correlation value.
fig = plt.figure(figsize=(20,12))
sns.heatmap(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].corr(), annot=False,cmap='coolwarm')
plt.tight_layout
We can see that linear regression can give us only roughly approximation of total rentals from hour variable.
sns.jointplot(x='hour',y='count',kind='reg',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
Temp-Count scatter plot is skewed in right direction which indicates that higher temperature results in higher total bike rentals. But again, we can see it's only roughly approximation.
sns.jointplot(x='temp',y='count',kind='reg',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
sns.jointplot(x='temp',y='count',kind='hex',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
Let's have a close look on ditribution across different years.
train_no_dummy_year_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count',index='month',columns='year').rename(index=monthDict)
train_no_dummy_year_piv
It looks like there are much more rentals in 2012 than in 2011. Perhaps there were more bikes to rent in 2012 than in 2011. Another reason could be fact that people got used to rent a bike from Capital BikeShare and there were more registred users in total.
fig = plt.figure(figsize=(7,5))
sns.heatmap(train_no_dummy_year_piv,cmap='coolwarm')
train_no_dummy_month_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count',index='hour',columns='month').rename(columns=monthDict)
We can see that from April to November there is a bike season.
fig = plt.figure(figsize=(11,8))
sns.heatmap(train_no_dummy_month_piv,cmap='coolwarm')
train_no_dummy_workingday_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count_cat',index='hour',columns=['workingday'])
sns.clustermap(train_no_dummy_workingday_piv,cmap='coolwarm',standard_scale=1,method='average')
From the plot clustermap above we can devide hours into 4 categories:
</ol> </ol> We added those categories in variable hour_cat.
train_no_dummy_year_temp_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].round().pivot_table(values='count_cat',index='temp',columns=['workingday']).dropna()
sns.clustermap(train_no_dummy_year_temp_piv,cmap='coolwarm',standard_scale=1,method='average')
From the plot clustermap above we can devide rounded temp into 3 categories:
We added those categories in variable temp_cat.
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$Root Mean Squared Log Error (RMSLE) is the square root of the mean of the squared log errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(log(y_i+1)-log(\hat{y}_i+1))^2}$$Comparing these metrics:
All of these are loss functions, because we want to minimize them.
def evaluatorMAE_own(predictions, labelCol):
diff = np.abs(predictions[labelCol] - predictions['prediction'])
mean_error = diff.mean()
return mean_error
def evaluatorMSE_own(predictions, labelCol):
diff = predictions[labelCol] - predictions['prediction']
mean_error = np.square(diff).mean()
return mean_error
def evaluatorRMSE_own(predictions, labelCol):
diff = predictions[labelCol] - predictions['prediction']
mean_error = np.square(diff).mean()
return np.sqrt(mean_error)
def evaluatorRMSLE_own(predictions, labelCol):
diff = np.log(predictions[labelCol] + 1) - np.log(predictions['prediction'] + 1)
mean_error = np.square(diff).mean()
return np.sqrt(mean_error)
def evaluatorR2_own(predictions, labelCol):
SS_res = np.sum(np.square(predictions[labelCol] - predictions['prediction']))
SS_tot = np.sum(np.square(predictions[labelCol] - predictions[labelCol].mean()))
R2 = 1 - SS_res/SS_tot
return R2
def undo_log_transform(pred,labelCol, Kaggle=False):
predictions = pred.copy()
# Kaggle test data doesn't have labels
if (not Kaggle):
predictions[labelCol] = np.exp(predictions[labelCol]) - 1
predictions['prediction'] = np.exp(predictions['prediction']) - 1
return predictions
def evaluateMetrics(preds, labelCol, rollback=True, rounded=True):
predictions = preds.copy()
if (rollback):
predictions = undo_log_transform(predictions, labelCol)
if (rounded):
predictions = np.round(predictions)
print("MAE: " + str('%.4f' % evaluatorMAE_own(predictions, labelCol)))
print("MSE: " + str('%.4f' % evaluatorMSE_own(predictions, labelCol)))
print("RMSE: " + str('%.4f' % evaluatorRMSE_own(predictions, labelCol)))
print("R2: " + str('%.4f' % evaluatorR2_own(predictions, labelCol)))
print("----------------")
print("RMSLE: " + str('%.4f' % evaluatorRMSLE_own(predictions, labelCol)))
print("----------------")
return evaluatorRMSLE_own(predictions, labelCol)
Drop unnecessary added category variables from training dataset. Apache Spark needs all data to be numerical. Later on we will perform feature selection to chose best variables for our models.
all_no_dummy_interpolate = all_no_dummy_interpolate.drop(added_cat_var, axis=1)
all_dummy_interpolate = pd.get_dummies(data = all_no_dummy_interpolate.copy(), \
columns = categorical_var, drop_first=True)
Split trainind data into train and "test" to evaluate our models. Final model will predict labels for test_KAGGLE datasets.
from sklearn.model_selection import train_test_split
train_dummy = all_dummy_interpolate[all_dummy_interpolate['train'] == 1]
(trainingData_dummy, testData_dummy) = train_test_split(train_dummy, test_size=0.2, random_state=101)
(trainingData_no_dummy, testData_no_dummy) = \
train_test_split(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], test_size=0.2, random_state=101)
Create Apache Spark Data Frames
spark_train_dummy = sqlContext.createDataFrame(trainingData_dummy)
spark_test_dummy = sqlContext.createDataFrame(testData_dummy)
spark_train_no_dummy = sqlContext.createDataFrame(trainingData_no_dummy)
spark_test_no_dummy = sqlContext.createDataFrame(testData_no_dummy)
Prepare Kaggle test data.
test_dummy = all_dummy_interpolate[all_dummy_interpolate['train'] != 1]
# reset index for including datetime as column
spark_Kaggle_test_dummy = sqlContext.createDataFrame(test_dummy.reset_index())
spark_Kaggle_test_no_dummy = sqlContext.createDataFrame(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1].reset_index())
Apache Spark Machine Learning lib requires input as data frame transformed to labeled point.
from pyspark.ml.linalg import Vectors
DUMMY = True
columns_to_hide = added_cat_var + dependent_variables + dependent_variables_log \
+ dependent_variables_cat + ['train','index']
dummy_cols = [x for x in spark_train_dummy.columns if x not in columns_to_hide]
no_dummy_cols = [x for x in spark_train_no_dummy.columns if x not in columns_to_hide]
def transformToLabeledPoint(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
label_count = row["count_log"]
label_registered = row["registered_log"]
label_casual = row["casual_log"]
return label_count, label_registered, label_casual, Vectors.dense(retArray)
def transformToLabeledPointKaggle(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
datetime = row["index"]
# lables must match this from transformToLabeledPoint for further unionAll() operation
return datetime,0,0, Vectors.dense(retArray)
def transformToLabeledPointChiSqSelector(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
label_count_cat = row["count_cat"]
label_registered_cat = row["registered_cat"]
label_casual_cat = row["casual_cat"]
return label_count_cat, label_registered_cat, label_casual_cat, Vectors.dense(retArray)
Preparation Apache Spark data frame for linear regression
DUMMY = True
trainingData_dummy = sqlContext.createDataFrame(spark_train_dummy.rdd.map(transformToLabeledPoint), ["label_count", "label_registered", "label_casual", "features_to_filter"])
testData_dummy = sqlContext.createDataFrame(spark_test_dummy.rdd.map(transformToLabeledPoint), ["label_count", "label_registered", "label_casual", "features_to_filter"])
testKaggle_dummy = sqlContext.createDataFrame(spark_Kaggle_test_dummy.rdd.map(transformToLabeledPointKaggle), ["label_count", "label_registered", "label_casual", "features_to_filter"])
chiSq_dummy = sqlContext.createDataFrame(spark_train_dummy.unionAll(spark_test_dummy)\
.rdd.map(transformToLabeledPointChiSqSelector), ["label_count_cat", "label_registered_cat", "label_casual_cat", "features_to_filter"])
Preparation Apache Spark data frame for random forest and gradient boosted regression. No need for dummy variables.
DUMMY = False
trainingData_no_dummy = sqlContext.createDataFrame(spark_train_no_dummy.rdd.map(transformToLabeledPoint), ["label_count", "label_registered", "label_casual", "features_to_filter"])
testData_no_dummy = sqlContext.createDataFrame(spark_test_no_dummy.rdd.map(transformToLabeledPoint), ["label_count", "label_registered", "label_casual", "features_to_filter"])
testKaggle_no_dummy = sqlContext.createDataFrame(spark_Kaggle_test_no_dummy.rdd.map(transformToLabeledPointKaggle), ["label_count", "label_registered", "label_casual","features_to_filter"])
chiSq_no_dummy = sqlContext.createDataFrame(spark_train_no_dummy.unionAll(spark_test_no_dummy)\
.rdd.map(transformToLabeledPointChiSqSelector), ["label_count_cat", "label_registered_cat", "label_casual_cat", "features_to_filter"])
Cache spark data frames into memory for faster computation
trainingData_dummy.cache()
trainingData_dummy.count()
trainingData_no_dummy.cache()
trainingData_no_dummy.count()
testData_dummy.cache()
testData_dummy.count()
testData_no_dummy.cache()
testData_no_dummy.count()
testKaggle_dummy.cache()
testKaggle_dummy.count()
testKaggle_no_dummy.cache()
testKaggle_no_dummy.count()
To treat categorical variables as nominal we must first index them by featureIndexer. Features with number of unique values above maxCategories will be treated as continuous.
from pyspark.ml.feature import VectorIndexer
featureIndexer =\
VectorIndexer(inputCol="features_to_filter", outputCol="indexedFeatures", maxCategories=4) \
.fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
Use ChiSquare Selector ( https://en.wikipedia.org/wiki/Chi-squared_test ) to choose best features to our models.
feature_selector_dummy = {}
feature_selector_no_dummy = {}
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
for label in ["label_count", "label_registered", "label_casual"]:
selector_dummy = ChiSqSelector( selectorType="fpr", percentile=0.5, fpr=0.05, featuresCol="features_to_filter",
outputCol="features", labelCol=label + "_cat")
# we would like to connect selector with feature indexer so we must fit selector to
# indexed data before we will be able to transform indexed data
selector_no_dummy = ChiSqSelector(selectorType="numTopFeatures", fpr=0.01, numTopFeatures=15, featuresCol="indexedFeatures",
outputCol="features", labelCol=label + "_cat")
selected_features_no_dummy = selector_no_dummy.fit(featureIndexer \
.transform(chiSq_no_dummy))
selected_features_dummy = selector_dummy.fit(chiSq_dummy)
feature_selector_no_dummy[label] = selected_features_no_dummy
feature_selector_dummy[label] = selected_features_dummy
# print selected columns for no_dummy option
print(label + "_no_dummy: ChiSqSelector output with %d features selected below %.2f p-Value" \
% (len(selected_features_no_dummy.selectedFeatures), selector_no_dummy.getFpr()))
for ind in selected_features_no_dummy.selectedFeatures:
print("\t- " + no_dummy_cols[ind])
print("")
Now its time to train and tune our model on training data using k-fold cross validation method!
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
As mentioned before, LinearRegression for good performance needs data with dummy variables.
def make_prediction(models_dict, dummy=False, Kaggle=False):
selector = feature_selector_no_dummy
# Kaggle test data is used only with no_dummy
if (dummy):
selector = feature_selector_dummy
if (Kaggle):
test_data = testKaggle_dummy
else:
test_data = testData_dummy
elif (Kaggle):
test_data = featureIndexer.transform(testKaggle_no_dummy)
else:
test_data = featureIndexer.transform(testData_no_dummy)
predictionsTestData_r = models_dict['label_registered'].transform(selector['label_registered'].transform(test_data))
predictionsTestData_c = models_dict['label_casual'].transform(selector['label_casual'].transform(test_data))
predictionsTestData_count = models_dict['label_count'].transform(selector['label_count'].transform(test_data))
return predictionsTestData_r, predictionsTestData_c, predictionsTestData_count
Function for evaluation metrics, support both functionalities: rounded predictions and not.
def evaluate_prediction(predictionsTestData_r, predictionsTestData_c, predictionsTestData_count, rounded=True):
print("Evaluation prediction for registred users:")
pandas_pred_r = predictionsTestData_r.toPandas()
pandas_pred_r.loc[(pandas_pred_r['prediction'] < 0), 'prediction'] = 0
rmsle_r = evaluateMetrics(pandas_pred_r, 'label_registered', rounded=rounded)
print()
print("Evaluation prediction for casual users:")
pandas_pred_c = predictionsTestData_c.toPandas()
pandas_pred_c.loc[(pandas_pred_c['prediction'] < 0), 'prediction'] = 0
rmsle_c = evaluateMetrics(pandas_pred_c, 'label_casual', rounded=rounded)
print()
print("Evaluation prediction for sum of both models: registred + casual users:")
pandas_pred_c_undo_log = undo_log_transform(pandas_pred_c,'label_casual')
pandas_pred_c_undo_log.loc[(pandas_pred_c_undo_log['prediction'] < 0), 'prediction'] = 0
pandas_pred_r_undo_log = undo_log_transform(pandas_pred_r,'label_registered')
pandas_pred_r_undo_log.loc[(pandas_pred_r_undo_log['prediction'] < 0), 'prediction'] = 0
pandas_pred_sum = pd.DataFrame()
pandas_pred_sum['label_count'] = pandas_pred_c_undo_log['label_casual'] + pandas_pred_r_undo_log['label_registered']
pandas_pred_sum['prediction'] = pandas_pred_c_undo_log['prediction'] + pandas_pred_r_undo_log['prediction']
rmsle_sum = evaluateMetrics(pandas_pred_sum, 'label_count', rollback=False, rounded=rounded)
print()
print("Evaluation prediction for one count users model:")
pandas_pred_count = predictionsTestData_count.toPandas()
pandas_pred_count.loc[(pandas_pred_count['prediction'] < 0), 'prediction'] = 0
rmsle_count = evaluateMetrics(pandas_pred_count, 'label_count', rounded=rounded)
# We drop features variable in case to mix prediction from dummy and no_dummy models;
# dummy features vector has much more positions than no_dummy, so there would be problem to connect both.
# We just won't need features vector any more.
prediction_dict = {'registered' : undo_log_transform(pandas_pred_r.drop(['features'], axis = 1).copy(),'label_registered'),
'casual': undo_log_transform(pandas_pred_c.drop(['features'], axis = 1).copy(),'label_casual'),
'sum': pandas_pred_sum.copy(),
'count': undo_log_transform(pandas_pred_count.drop(['features'], axis = 1).copy(),'label_count'),
'rmsle_r': rmsle_r,
'rmsle_c': rmsle_c,
'rmsle_sum': rmsle_sum,
'rmsle_count': rmsle_count}
return prediction_dict
Function for mixing predictions from two different models together with respect to specifc ratio rate.
def evaluate_mixed_prediction(rf_pred_dict, bgtr_pred_dict, ratio=0.5, rounded=True):
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for registred users:")
mixed_pandas_r = ratio * rf_pred_dict['registered'] + (1.0 - ratio) * bgtr_pred_dict['registered']
rmsle_r = evaluateMetrics(mixed_pandas_r, 'label_registered', rollback=False, rounded=rounded)
print()
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for casual users:")
mixed_pandas_c = ratio * rf_pred_dict['casual'] + (1.0 - ratio) * bgtr_pred_dict['casual']
rmsle_c = evaluateMetrics(mixed_pandas_c, 'label_casual', rollback=False, rounded=rounded)
print()
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for sum of both models: registred + casual users:")
mixed_pandas_sum = ratio * rf_pred_dict['sum'] + (1.0 - ratio) * bgtr_pred_dict['sum']
rmsle_sum = evaluateMetrics(mixed_pandas_sum, 'label_count', rollback=False, rounded=rounded)
print()
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for one count users model:")
mixed_pandas_count = ratio * rf_pred_dict['count'] + (1.0 - ratio) * bgtr_pred_dict['count']
rmsle_count = evaluateMetrics(mixed_pandas_count, 'label_count', rollback=False, rounded=rounded)
mixed_prediction_dict = {'registered' : mixed_pandas_r.copy(),
'casual': mixed_pandas_c.copy(),
'sum': mixed_pandas_sum.copy(),
'count': mixed_pandas_count.copy(),
'rmsle_sum' : rmsle_sum}
return mixed_prediction_dict
Function for final prediction for Kaggle test dataset. Support mixing prediction with specific ratio. Final sum of predictions is rounded.
def predict_Kaggle_test(rf_predictionsTestData_r, rf_predictionsTestData_c, \
gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, ratio=0.5):
# 1. Although Kaggle test dataset does not have labels, all prediction are Spark dataframes
# with schema ["label_count", "label_registered", "label_casual", "features"]
# for coherent format with training data which enables us to use Spark featureIndexer during traingin models
# featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
#
# 2. We used column 'label_count' for convenient cache datatime variable used for submission file format
# pandas_pred_sum.rename(columns={"label_count": "datetime"})
#
# 3. Predicted values have to be numerical
# 4. labelCol is set to '' becouse Kaggle test dataset doesn't have one
rf_pandas_pred_r = undo_log_transform(rf_predictionsTestData_r.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
rf_pandas_pred_c = undo_log_transform(rf_predictionsTestData_c.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
gbtr_pandas_pred_r = undo_log_transform(gbtr_predictionsTestData_r.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
gbtr_pandas_pred_c = undo_log_transform(gbtr_predictionsTestData_c.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
rf_pandas_pred_r.loc[(rf_pandas_pred_r['prediction'] < 0), 'prediction'] = 0
rf_pandas_pred_c.loc[(rf_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_r.loc[(gbtr_pandas_pred_r['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_c.loc[(gbtr_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
pandas_pred_sum = rf_pandas_pred_r.copy()
pandas_pred_sum['count'] = 0.0
pandas_pred_sum['count'] = ratio * (rf_pandas_pred_r['prediction'] + rf_pandas_pred_c['prediction']) + \
(1.0 - ratio) * (gbtr_pandas_pred_r['prediction'] + gbtr_pandas_pred_c['prediction'])
pandas_pred_sum['count'] = np.round(pandas_pred_sum['count']).astype(int)
# other option to truncate floating point number instead of round
# pandas_pred_sum['count'] = pandas_pred_sum['count'].astype(int)
return pandas_pred_sum.rename(columns={"label_count": "datetime"})[['datetime','count']]
def predict_count_Kaggle_test(rf_predictionsTestData_c, gbtr_predictionsTestData_c, ratio=0.5):
# 1. Although Kaggle test dataset does not have labels, all prediction are Spark dataframes
# with schema ["label_count", "label_registered", "label_casual", "features"]
# for coherent format with training data which enables us to use Spark featureIndexer during traingin models
# featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
#
# 2. We used column 'label_count' for convenient cache datatime variable used for submission file format
# pandas_pred_sum.rename(columns={"label_count": "datetime"})
#
# 3. Predicted values have to be numerical
# 4. labelCol is set to '' becouse Kaggle test dataset doesn't have one
rf_pandas_pred_c = undo_log_transform(rf_predictionsTestData_c.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
gbtr_pandas_pred_c = undo_log_transform(gbtr_predictionsTestData_c.toPandas().drop(['features'], axis = 1), labelCol='', Kaggle=True)
rf_pandas_pred_c.loc[(rf_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_c.loc[(gbtr_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
pandas_pred_sum = rf_pandas_pred_c.copy()
pandas_pred_sum['count'] = 0.0
pandas_pred_sum['count'] = ratio * (rf_pandas_pred_c['prediction']) + \
(1.0 - ratio) * (gbtr_pandas_pred_c['prediction'])
pandas_pred_sum['count'] = np.round(pandas_pred_sum['count']).astype(int)
# other option to truncate floating point number instead of round
# pandas_pred_sum['count'] = pandas_pred_sum['count'].astype(int)
return pandas_pred_sum.rename(columns={"label_count": "datetime"})[['datetime','count']]
We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance. This will allow us to jointly choose parameters for all Pipeline stages. A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. We use a ParamGridBuilder to construct a grid of parameters to search over.
| regularizer $R(w)$ | gradient or sub-gradient | |
|---|---|---|
| zero (unregularized) | 0 | 0 |
| L2 | $\frac{1}{2}\|w\|^2$ | $w$ |
| L1 | $\|w\|$ | $\mathrm{sign}(w)$ |
| elastic net | $\alpha \|w\| + (1-\alpha)\frac{1}{2}\|w\|^2$ | $\alpha \mathrm{sign}(w) + (1-\alpha) w$ |
# lr = LinearRegression(maxIter=1000)
# lr_cv_models_dict = {}
# regParam = [0.01, 0.001, 0.0001, 0.00001]
# elasticNetParam = [0, 0.25, 0.5, 0.75, 1.0]
# pipeline = Pipeline(stages=[lr])
# for label in ['label_registered','label_casual', 'label_count']:
# paramGrid = ParamGridBuilder() \
# .addGrid(lr.regParam, regParam) \
# .addGrid(lr.elasticNetParam, elasticNetParam) \
# .addGrid(lr.labelCol, [label]) \
# .build()
# crossval = CrossValidator(estimator=pipeline,
# estimatorParamMaps=paramGrid,
# evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
# numFolds=5) # (5x5)x5 = 125 models to check
# start = time.time()
# # Run cross-validation, and choose the best set of parameters on dummy dataset.
# # trainingData_dummy + testData_dummy = whole Kaggle Train data, we separarated this sets to simply evaluate models
# # we cross-validate on whole Kaggle train data to find best params to predict whole Kaggle Test set
# # use only features chosen before by ChiSqSelector
# cvModel = crossval.fit(feature_selector_dummy[label].transform(trainingData_dummy.unionAll(testData_dummy)))
# print("======= CV for " + label + " =========")
# end = time.time()
# print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
# bestModel = cvModel.bestModel.stages[0]._java_obj
# print("RegParam: " + str(bestModel.getRegParam()))
# print("ElasticNetParam: " + str(bestModel.getElasticNetParam()))
# lr_cv_models_dict[label] = cvModel
# print()
Now that we have fit our model, let's evaluate its performance by predicting off the test values with the best values from cross validation!.
# lr_predictionsTestData_r, lr_predictionsTestData_c, lr_predictionsTestData_count = make_prediction(lr_cv_models_dict, dummy=True)
Let's evaluate our model performance by calculating the evaluation metrics. Calculate the Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and the Root Mean Squared Log Error . Refer cell above for the formulas.
# lr_pred_dict = evaluate_prediction(lr_predictionsTestData_r, lr_predictionsTestData_c, lr_predictionsTestData_count)
# coefficients = pd.DataFrame(LinearRegression(regParam=0.00001, elasticNetParam=0.25, maxIter=1000)\
# .setLabelCol('label_count')
# .fit(feature_selector_dummy[label].transform(trainingData_dummy))
# .coefficients.array,
# [dummy_cols[i] for i in feature_selector_dummy[label].selectedFeatures])
# coefficients.columns = ['Coefficient']
Hour variables have the biggest coefficients values in linera regression equation. Amongs them, peak hours have the biggest one.
# plt.figure(figsize=(12,20))
# sns.heatmap(coefficients, cmap='coolwarm')
Let's quickly explore the residuals to make sure everything was okay with our data. It's good idea to plot a histogram of the residuals and make sure it looks normally distributed.
# plt.scatter(lr_pred_dict['sum']['label_count'],lr_pred_dict['sum']['prediction'])
# plt.xlabel('Y Test')
# plt.ylabel('Predicted Y')
# plt.title('Linear Regression')
Create a scatterplot of the real test values versus the predicted values.
# sns.distplot((lr_pred_dict['sum']['label_count'] - lr_pred_dict['sum']['prediction']),bins=50);
# plt.xlabel('Linear Regression Residuals')
Decision trees are a popular family of classification and regression methods. We will use them as step before ensemled models training, to verify feature selection from ChiSqSelector for no_dummy option as well as to verify new subsets of features to minimalize loss function.
Using all features - model relies on indexedFeatures insted of features (filtered).
# Train a DecisionTree model.
dt = DecisionTreeRegressor(maxDepth=15, minInstancesPerNode=10, featuresCol="indexedFeatures")
dt_cv_models_dict = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], dt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy)
dt_cv_models_dict[label] = model
dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count = make_prediction(dt_cv_models_dict, dummy=False)
dt_pred_dict = evaluate_prediction(dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count)
Using all features we minimalized RMSLE metric to 0.3677.
Let's find out whether on the same model params, using 15 features from ChiSqSelector, we will get better result.
# Train a DecisionTree model.
dt = DecisionTreeRegressor(maxDepth=15, minInstancesPerNode=10, featuresCol="features")
dt_cv_models_dict = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], dt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy)
dt_cv_models_dict[label] = model
dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count = make_prediction(dt_cv_models_dict, dummy=False)
dt_pred_dict = evaluate_prediction(dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count)
fig,axes= plt.subplots(nrows=3, ncols=2, figsize=(11,10))
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['count'], bins = 24, ax=axes[0][0])
sns.distplot(dt_pred_dict['count']['prediction'], bins = 24, ax=axes[0][1])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['casual'], bins = 24, ax=axes[1][0])
sns.distplot(dt_pred_dict['casual']['prediction'], bins = 24, ax=axes[1][1])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['registered'], bins = 64, ax=axes[2][0])
sns.distplot(dt_pred_dict['registered']['prediction'], bins = 64, ax=axes[2][1])
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Overall resutls are slightly better. Let's find out how number of chosen features impacts final metric.
dt = DecisionTreeRegressor(maxDepth=25, minInstancesPerNode=5, featuresCol="features")
dt_cv_models_dict = {}
rmsle_registered = []
rmsle_casual = []
rmsle_sum = []
rmsle_count = []
for num in range(1,len(no_dummy_cols) + 1):
for label in ["label_count", "label_registered", "label_casual"]:
selector_no_dummy = ChiSqSelector(selectorType="numTopFeatures", fpr=0.01, numTopFeatures=num, \
featuresCol="indexedFeatures",outputCol="features", labelCol=label + "_cat")
selected_features_no_dummy = selector_no_dummy.fit(featureIndexer \
.transform(chiSq_no_dummy))
pipeline = Pipeline(stages=[featureIndexer, selected_features_no_dummy, dt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy)
dt_cv_models_dict[label] = model
dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count = make_prediction(dt_cv_models_dict, dummy=False)
dt_pred_dict = evaluate_prediction(dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count)
rmsle_registered.append(dt_pred_dict['rmsle_r'])
rmsle_casual.append(dt_pred_dict['rmsle_c'])
rmsle_sum.append(dt_pred_dict['rmsle_sum'])
rmsle_count.append(dt_pred_dict['rmsle_count'])
From the plot below it's clear that 15 features with the smallest p-Value do the job. Increasing number of features doesn't lead to any metric improvemnt and there is a risk of overfitting. Feature number 7 for registered, sum and count is 'hour' - has the biggest impact on minimizing final score. The same variable 'hour' for casual has index number 9 and has the biggest impact as well.
for ind in feature_selector_no_dummy['label_count'].selectedFeatures:
print("\t- " + no_dummy_cols[ind])
fig = plt.figure(figsize=(12,5))
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_registered, label='Registered')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_casual , label='Casual')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_sum, label='Sum', c='red')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_count, label='Count', marker='o')
plt.xlabel("Number of sorted features by p-Value (ascending)")
plt.ylabel("RMSLE")
plt.title("Trade off - number of features used by model")
plt.legend()
Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.
Random forests train a set of decision trees separately, so the training can be done in parallel. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predictions, improving the performance on test data.
The randomness injected into the training process includes:
Apart from these randomizations, decision tree training is done in the same way as for individual decision trees.To make a prediction on a new instance, a random forest must aggregate the predictions from its set of decision trees. This aggregation is done differently for classification and regression.For regression each tree predicts a real value. The label is predicted to be the average of the tree predictions.
# model = RandomForestRegressor(featuresCol="indexedFeatures")
# numTrees = [50, 100, 200, 400, 800, 1000, 2000]
# maxDepth = [15, 20, 25, 30]
# minInstancesPerNode = [5, 10, 15]
# maxBins = [32, 64, 128]
# rf_cv_models_dict = {}
# for label in ['label_registered','label_casual', 'label_count']:
# pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], model])
# paramGrid = ParamGridBuilder() \
# .addGrid(model.numTrees, numTrees) \
# .addGrid(model.maxDepth, maxDepth) \
# .addGrid(model.minInstancesPerNode, minInstancesPerNode) \
# .addGrid(model.maxBins, maxBins) \
# .addGrid(model.labelCol, [label]) \
# .build()
# crossval = CrossValidator(estimator=pipeline,
# estimatorParamMaps=paramGrid,
# evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
# numFolds=4) # (5x5)x5 = 125 models to check
# start = time.time()
# cvModel = crossval.fit(trainingData_no_dummy.unionAll(testData_no_dummy))
# print("======= CV for " + label + " =========")
# end = time.time()
# print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
# bestModel = cvModel.bestModel.stages[0]._java_obj
# print("NumTrees: " + str(bestModel.getNumTrees()))
# print("MaxDepth: " + str(bestModel.getMaxDepth()))
# print("MinInstancesPerNode: " + str(bestModel.getMinInstancesPerNode()))
# print("MaxBins: " + str(bestModel.getMaxBins()))
# rf_cv_models_dict[label] = cvModel
# print()
Create models with params from k-fold cross validation above manually becouse cross validation is very time consuming.
rf_cv_models_dict = {}
rfr = RandomForestRegressor(numTrees=1000, maxDepth=25, minInstancesPerNode=10, featuresCol="features")
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], rfr.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy)
rf_cv_models_dict[label] = model
rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count = make_prediction(rf_cv_models_dict, dummy=False)
rf_pred_dict = evaluate_prediction(rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count)
plt.scatter(rf_pred_dict['sum']['label_count'],rf_pred_dict['sum']['prediction'])
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.title('Random Forest Regression')
sns.distplot((rf_pred_dict['sum']['label_count'] - rf_pred_dict['sum']['prediction']),bins=50);
plt.xlabel('Random Forest Regression Residuals')
Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.
Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:
In short, both algorithms can be effective, and the choice should be based on the particular dataset.
# model = GBTRegressor(featuresCol="indexedFeatures")
# minInstancesPerNode = [7,8,9,10,11,12,15]
# maxDepth = [4,5,6,7,10,15]
# maxIter = [100, 150, 300, 500]
# maxBins = [32, 64, 128]
# gbtr_cv_models_dict = {}
# for label in ['label_registered','label_casual', 'label_count']:
# pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], model])
# paramGrid = ParamGridBuilder() \
# .addGrid(model.minInstancesPerNode, minInstancesPerNode) \
# .addGrid(model.maxDepth, maxDepth) \
# .addGrid(model.maxIter, maxIter) \
# .addGrid(model.maxBins, maxBins) \
# .addGrid(model.labelCol, [label]) \
# .build()
# crossval = CrossValidator(estimator=pipeline,
# estimatorParamMaps=paramGrid,
# evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
# numFolds=4) # (5x5)x5 = 125 models to check
# start = time.time()
# cvModel = crossval.fit(trainingData_no_dummy.unionAll(testData_no_dummy))
# print("======= CV for " + label + " =========")
# end = time.time()
# print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
# bestModel = cvModel.bestModel.stages[0]._java_obj
# print("MinInstancesPerNode: " + str(bestModel.getMinInstancesPerNode()))
# print("MaxDepth: " + str(bestModel.getMaxDepth()))
# print("MaxIter: " + str(bestModel.getMaxIter()))
# print("MaxBins: " + str(bestModel.getMaxBins()))
# gbtr_cv_models_dict[label] = cvModel
# print()
gbtr_cv_models_dict = {}
#gbt = GBTRegressor(maxDepth=5, minInstancesPerNode=10, maxMemoryInMB=512, maxBins=32, maxIter=150, subsamplingRate=0.7, featuresCol="indexedFeatures")
gbt = GBTRegressor(maxDepth=5, minInstancesPerNode=10, maxMemoryInMB=512, maxBins=128, maxIter=250, featuresCol="features")
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], gbt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy)
gbtr_cv_models_dict[label] = model
gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, gbtr_predictionsTestData_count = make_prediction(gbtr_cv_models_dict, dummy=False)
gbtr_pred_dict = evaluate_prediction(gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, gbtr_predictionsTestData_count)
plt.scatter(gbtr_pred_dict['sum']['label_count'],gbtr_pred_dict['sum']['prediction'])
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.title('Gradient Boosted Tree Regression')
sns.distplot((gbtr_pred_dict['sum']['label_count'] - gbtr_pred_dict['sum']['prediction']),bins=50);
plt.xlabel('Gradient Boosted Tree Regression Residuals')
We trained three machine learning algorithms. We used K-fold cross validation for parameter tuning. Of course we couldn't check all posibly combination due to computancy limitation. Especially Gradient Boosted Regression was hard to train becouse of its iterative manner. On the other hand it predicts very fast - this is one of the reason of its popularity (ex. for page rank cases). As we might suspect, non-linearity models gave us better results. Gradient Boosted Regression won the game with the score 0.3107 !
RMSLE |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Linear Regression | 0.5539 |
0.5867 | 0.5333 |
0.5472 |
| Random Forest Regression | 0.3445 |
0.4889 | 0.3409 |
0.3614 |
| Gradient Boosted Regression | 0.3117 |
0.5089 | 0.3107 |
0.3264 |
Training two models to predict casual and registred rentals accordingly gave us slightly better results. In all models prediction for casual users were significant worse than for registered users. It follows that there is no strict rule for renting cars for casual users / there is some other variable, not present in our dataset, that could better model casual renting.
RMSLE can be used when we don’t want to penalize huge differences when both the values are huge numbers, on the other hand it's harder to compare those results to others. That's why we also measured $R^2$ metric which in a way is standariezed.
$R^2$ |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Linear Regression | 0.8041 |
0.7262 | 0.8085 |
0.8019 |
| Random Forest Regression | 0.9143 |
0.8923 | 0.9204 |
0.9109 |
| Gradient Boosted Regression | 0.9373 |
0.8941 | 0.9427 |
0.9293 |
Depends on $R^2$ value we assume that compatibility is:
It's worth to notice that combinig two models results in better accuracy than each of them had separately.
RMSLE |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Random Forest Regression | 0.3117 |
0.5089 |
0.3107 |
0.3264 |
So it's good idea to combine prediction from two best machine learning algorithms: Random Forest Regression and Gradient Boosted Regression
We need to plot curve ratio to RMSLE to verify the split proportion
rmsle_list = []
ratio_array = np.linspace(0.05,1,20)
for ratio in ratio_array:
rmsle_list.append(evaluate_mixed_prediction(rf_pred_dict, gbtr_pred_dict, ratio=float(ratio))['rmsle_sum'])
We can see that split proportion around 0.3 gives us best result.
plt.plot(ratio_array, rmsle_list)
plt.xlabel('Ratio')
plt.ylabel('RMSLE')
plt.title('Ratio * RandomForest + (1 - Ratio) * GradientBoostedTree')
'%.3f' % min(rmsle_list)
Get the best ratio split.
# best_ratio = float(ratio_array[rmsle_list.index(min(rmsle_list))])
best_ratio = float(0.2)
# featureIndexer has to be fitted to all data (train and test) in order to work properly
featureIndexer =\
VectorIndexer(inputCol="features_to_filter", outputCol="indexedFeatures", maxCategories=4)\
.fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
Previous models were trained on 0.8 part of whole train data set. Now we need to train on whole dataset with chosen params.
# model params taken from previous cv
rfr = RandomForestRegressor(numTrees=1000, maxDepth=15, minInstancesPerNode=10, featuresCol="features")
rf_cv_models_dict_final = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], rfr.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy.unionAll(testData_no_dummy))
rf_cv_models_dict_final[label] = model
# model params taken from previous cv
gbt = GBTRegressor(maxDepth=5, minInstancesPerNode=10, maxIter=150, featuresCol="features")
gbtr_cv_models_dict_final = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[label], gbt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy.unionAll(testData_no_dummy))
gbtr_cv_models_dict_final[label] = model
gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, gbtr_predictionsTestData_count = make_prediction(gbtr_cv_models_dict_final, dummy=False, Kaggle=True)
rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count = make_prediction(rf_cv_models_dict_final, dummy=False, Kaggle=True)
Get 'raw' prediction on Kaggle test dataset from best two models: Random Forest Regression and Gradient Boosted Regression Tree.
# kaggleSubmission = predict_Kaggle_test(rf_predictionsTestData_r, rf_predictionsTestData_c, \
# gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, ratio = best_ratio)
kaggleSubmission = predict_count_Kaggle_test(rf_predictionsTestData_count, gbtr_predictionsTestData_count, ratio = best_ratio)
kaggleSubmission['datetime'] = pd.to_datetime(kaggleSubmission['datetime'])
kaggleSubmission.head(10)
kaggleSubmission.info()
kaggleSubmission.to_csv('../data/kaggleSubmission.csv', index=False)
Due to computation and time limitations there are ares where additional research could be done: